💻 LSE DS202W 2024: Week 11 - Lab

Author

Shuja Ali

Published

March 25, 2024

⚙️ Setup

📋 Lab Tasks

Part 0

  • Export your chat logs

Part I - Meet a new dataset (15 mins)

No need to wait! Start tackling the action points below when you come to the classroom.

This dataset we are going to use today, Political Apologies across Cultures (PAC), was assembled by the Political Apologies x Cultures project. The data consists of an inventory of political apologies offered by states or state representatives to a collective for human rights violations that happened in the recent or distant past.

🎯 ACTION POINTS

  1. Go to the Political Apologies x Cultures project website, click on Database and then click on Coded Database. This will download a file called PAC_Coded-Apologies_Public-Version-2.xlsx to your computer.

  2. Before opening it in R, take some time to look at the data in its raw format using MS Excel (or Google Sheets). What do you see? Which tabs and columns seem interesting?

  3. Create a df_pac and pre-process the date_original column. We can use the read_excel function from the tidyverse package readxl to read Excel spreadsheets:

df_pac <- 
  readxl::read_excel("PAC_Coded-Apologies_Public-Version-2.xlsx",
                     sheet="PAC_coding_Template", 
                     .name_repair="minimal") %>% 
  janitor::clean_names() %>%
  drop_na(date_original) %>%
  mutate(date_original=
          case_when(
            str_starts(date_original, "0.0.") ~ str_replace(date_original, "0.0.", "1.1."), str_starts(date_original, "0.")   ~ str_replace(date_original, "0.", "1."), 
            .default=date_original), 
         date_original=lubridate::dmy(date_original))  %>% 
  arrange(desc(date_original))

df_pac %>% glimpse()
Rows: 396
Columns: 91
$ no                         <dbl> 512, 511, 509, 498, 497, 493, 487, 506, 503…
$ date_original              <date> 2022-07-11, 2022-06-08, 2022-03-23, 2022-0…
$ date                       <chr> "11.07.2022", "08.06.2022", "23.03.2022", "…
$ year_1                     <dbl> 2022, 2022, 2022, 2022, 2022, 2021, 2021, 2…
$ year_2                     <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,…
$ year_cat                   <chr> "10 =2020-2024", "10 =2020-2024", "10 =2020…
$ description                <chr> "Dutch Minister of Defence Kasja Ollongren …
$ country_s                  <chr> "Netherlands", "Belgium", "United Kingdom o…
$ check_country_s            <chr> "Netherlands", "Belgium", "United Kingdom o…
$ count_s_iso_alpha          <chr> "NLD", "BEL", "GBR", "DNK", "NLD", "CAN", "…
$ count_s_iso_num            <dbl> 528, 56, 826, 208, 528, 124, 528, 703, 376,…
$ region_s_un                <dbl> 150, 150, 150, 150, 150, 19, 150, 150, 142,…
$ region_s_oecd              <chr> "Europe", "Europe", "Europe", "Europe", "Eu…
$ name_send                  <chr> "Kasja Ollongren", "Philippe of Belgium", "…
$ role_send                  <chr> "4 =Minister", "1 =King/Queen/Emperor", "8 …
$ role_send_specify_other    <chr> NA, NA, "Prince", NA, NA, NA, NA, NA, NA, N…
$ off_send                   <chr> "0 =No", "0 =No", "0 =No", "0 =No", "0 =No"…
$ political_party            <chr> "Democrats 66", NA, NA, "Social Democratic …
$ political_color            <chr> "5 =Centre-left", NA, NA, "5 =Centre-left",…
$ count_rec                  <chr> "Bosnia and Herzegovina", "Democratic Repub…
$ check_count_rec            <chr> "Bosnia and Herzegovina", "Democratic Repub…
$ count_r_iso_alpha          <chr> "BIH", "COD", "JAM", "GRL", "IDN", "CAN", "…
$ count_r_iso_num            <chr> "70", "180", "388", "304", "360", "124", "5…
$ region_r_un                <chr> "150", "2", "19", "19", "142", "19", "150",…
$ region_r_oecd              <chr> "Europe", "Africa", "Americas", "Americas",…
$ group_rec_id_1             <chr> "3 =Ethnicity/race", "1 =Nation/citizenship…
$ group_rec_id_2             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ group_rec_id_3             <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ group_rec_id_specify_other <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ group_rec_1                <chr> "2 =Between-country", "2 =Between-country",…
$ group_rec_1_1              <chr> NA, NA, NA, NA, NA, NA, "1 =Minority", NA, …
$ group_rec_2                <chr> "3 =Both primary and secondary victim-group…
$ context_1                  <chr> "War: Yugoslav wars", "Colonial rule: Belgi…
$ context_2                  <chr> NA, NA, NA, NA, "War: Netherlands - Indones…
$ context_3                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ context_cat_1              <chr> "War", "Colonial rule", "Slavery", "Treatme…
$ context_cat_2              <chr> NA, NA, NA, NA, "War", NA, NA, NA, "Settler…
$ context_cat_3              <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hrv_1                      <chr> "20 =Not specified", "6 =Colonization", "4 …
$ hrv_2                      <chr> NA, "8 =Racism/discrimination", NA, NA, NA,…
$ hrv_3                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "4 =For…
$ hrv_specify_other          <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hrv_date_start             <dbl> 1995, 1885, 1562, 1951, 1945, NA, 1985, 196…
$ hrv_date_end               <dbl> 1995, 1960, 1807, NA, 1949, NA, 2014, 2004,…
$ time_hrv_start             <dbl> 27, 137, 460, 71, 77, NA, 36, 55, 65, 145, …
$ time_hrv_end               <dbl> 27, 62, 215, NA, 73, NA, 7, 17, 65, 111, NA…
$ apol_set                   <chr> "5 =Commemoration", "7 =(Diplomatic) visit"…
$ apol_set_specify_other     <chr> NA, NA, NA, "Ceremony with victims", NA, "V…
$ apol_med                   <chr> "1 =Verbal", "1 =Verbal", "1 =Verbal", "1 =…
$ apol_lang                  <chr> "nld =Dutch", "fra =French", "eng =English"…
$ apol_lang_2                <chr> NA, NA, NA, NA, NA, NA, NA, NA, "ara =Arabi…
$ apol_text                  <chr> "2 =Full text", "2 =Full text", "2 =Full te…
$ apol_trans                 <chr> "0 =No translation", "4 =Translation found …
$ apol_light                 <chr> "0 = Completer apology", "0 = Completer apo…
$ gini                       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_y003                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a189                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a190                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a191                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a192                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a193                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a194                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a195                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a196                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a197                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a198                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_a199                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_d079                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_d080                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wvs_g006                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wwgi_vac                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wwgi_pst                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wwgi_gef                   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wwgi_rq                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wwgi_ro_l                  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ wwgi_cc                    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pts_a                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pts_h                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ pts_s                      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hof_pdi                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hof_idv                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hof_mas                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hof_uai                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hof_ltowvs                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ hof_ivr                    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ gdp                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x                          <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x_2                        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ count_s_iso_num_2          <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ x_3                        <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ apol_year                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
  1. Do an initial exploratory analysis. For example, you could pose the following question to the data: ‘Which country has apologised the most?’
# Here we used the check_* columns as they have been cleaned up by the project team
df_pac %>% count(check_country_s, sort=TRUE)
# A tibble: 84 × 2
   check_country_s                                          n
   <chr>                                                <int>
 1 Japan                                                   60
 2 Germany                                                 30
 3 United States of America                                27
 4 United Kingdom of Great Britain and Northern Ireland    26
 5 Canada                                                  19
 6 Netherlands                                             16
 7 New Zealand                                             12
 8 Republic of Korea                                       11
 9 Norway                                                   9
10 Belgium                                                  8
# ℹ 74 more rows

Or, perhaps: ‘Which country/region has received the most apologies?’

df_pac %>% count(check_count_rec, sort=TRUE)
# A tibble: 88 × 2
   check_count_rec                                          n
   <chr>                                                <int>
 1 *Transnational*                                         71
 2 United States of America                                23
 3 Canada                                                  20
 4 Republic of Korea                                       19
 5 Israel                                                  14
 6 Indonesia                                               12
 7 United Kingdom of Great Britain and Northern Ireland    12
 8 New Zealand                                             11
 9 Australia                                                9
10 Guatemala                                                9
# ℹ 78 more rows
  1. Create an apology_id column. It might be good to have a column with a very short identifier of the apology. We are looking for a short version to identify who apologies to whom and when, something like:
1947-03-04 USA -> MEX

Thankfully, the project team has already done some of the coding for us and converted country names to country codes following the ISO standard:

df_pac %>% select(count_s_iso_alpha, count_r_iso_alpha)
# A tibble: 396 × 2
   count_s_iso_alpha count_r_iso_alpha
   <chr>             <chr>            
 1 NLD               BIH              
 2 BEL               COD              
 3 GBR               JAM              
 4 DNK               GRL              
 5 NLD               IDN              
 6 CAN               CAN              
 7 NLD               NLD              
 8 SVK               SVK              
 9 ISR               ISR              
10 MEX               MEX              
# ℹ 386 more rows

Therefore, to achieve our goal, we just need to combine the date and the country codes:

df_pac <-
    df_pac %>% 
    mutate(apology_id = paste(date_original, count_s_iso_alpha, "->", count_r_iso_alpha, sep=" "))

Now look at that beautiful apology_id column:

df_pac %>% 
    select(date_original, count_s_iso_alpha, count_r_iso_alpha, apology_id)
# A tibble: 396 × 4
   date_original count_s_iso_alpha count_r_iso_alpha apology_id           
   <date>        <chr>             <chr>             <chr>                
 1 2022-07-11    NLD               BIH               2022-07-11 NLD -> BIH
 2 2022-06-08    BEL               COD               2022-06-08 BEL -> COD
 3 2022-03-23    GBR               JAM               2022-03-23 GBR -> JAM
 4 2022-03-10    DNK               GRL               2022-03-10 DNK -> GRL
 5 2022-02-17    NLD               IDN               2022-02-17 NLD -> IDN
 6 2021-12-13    CAN               CAN               2021-12-13 CAN -> CAN
 7 2021-11-27    NLD               NLD               2021-11-27 NLD -> NLD
 8 2021-11-25    SVK               SVK               2021-11-25 SVK -> SVK
 9 2021-10-29    ISR               ISR               2021-10-29 ISR -> ISR
10 2021-09-28    MEX               MEX               2021-09-28 MEX -> MEX
# ℹ 386 more rows

🗣 CLASSROOM DISCUSSION

Your class teacher will invite you to discuss the following questions with the rest of the class:

  • If our focus today wasn’t on the text describing the apologies, what other questions could we ask to this dataset?

🏡 TAKE-HOME ACTIVITY: Calculate the dataset’s most common Country (Sender) and Country (Receiver) pairs.

Part II - Summarising text data (15 min)

Go over the action points below and stop when your class teacher invites you to discuss something with the rest of the class. The code below is similar to the one used in the Week 10 lecture.

🎯 ACTION POINTS

  1. Build a corpus of text. The first step when performing quantitative text analysis is to create a corpus:
corp_pac <- quanteda::corpus(df_pac, text_field="description")
quanteda::docnames(corp_pac) <- df_pac$apology_id

corp_pac
Corpus consisting of 396 documents and 91 docvars.
2022-07-11 NLD -> BIH.1 :
"Dutch Minister of Defence Kasja Ollongren offered 'deepest a..."

2022-06-08 BEL -> COD.1 :
"Belgian King Filip expresses regret over Belgium's brutal co..."

2022-03-23 GBR -> JAM.1 :
"Prince William expressed his 'profound sorrow' for slavery i..."

2022-03-10 DNK -> GRL.1 :
"Denmark's Prime Minister Mette Frederiksen apologized in per..."

2022-02-17 NLD -> IDN.1 :
"Dutch Prime Minister Mark Rutte offered apologies for system..."

2021-12-13 CAN -> CAN.1 :
"Canada's defence minister apologized to victims of sexual as..."

[ reached max_ndoc ... 390 more documents ]
  1. Calculate and plot the number of tokens per description.
plot_df <- summary(corp_pac) %>% select(Text, Types, Tokens, Sentences)

g <- (
  ggplot(plot_df, aes(x=Tokens))
  + geom_bar(fill="#C63C4A")
  
  + labs(x="Number of Tokens",
         y="Count",
         title="How many tokens are used to describe the apologies?",
         caption="Figure 1. Distribution of\ntokens in the corpus")
  
  + scale_y_continuous(breaks=seq(0, 10+2, 2), limits=c(0, 10))
  
  # Prettify plot a bit
  + theme_bw()
  + theme(plot.title=element_text(size=rel(1.5)),
          plot.subtitle = element_text(size=rel(1.2)),
          axis.title=element_text(size=rel(1.3)),
          axis.title.x=element_text(margin=margin(t=10)),
          axis.title.y=element_text(margin=margin(r=10)),
          axis.text=element_text(size=rel(1.25)))
)

g

  1. Tokenisation. Observe how each text is now a list of tokens:
# This function extracts the tokens
tokens_pac <- quanteda::tokens(corp_pac)
tokens_pac
Tokens consisting of 396 documents and 91 docvars.
2022-07-11 NLD -> BIH.1 :
 [1] "Dutch"     "Minister"  "of"        "Defence"   "Kasja"     "Ollongren"
 [7] "offered"   "'"         "deepest"   "apologies" "'"         "for"      
[ ... and 12 more ]

2022-06-08 BEL -> COD.1 :
 [1] "Belgian"   "King"      "Filip"     "expresses" "regret"    "over"     
 [7] "Belgium's" "brutal"    "colonial"  "rule"      "in"        "a"        
[ ... and 6 more ]

2022-03-23 GBR -> JAM.1 :
 [1] "Prince"    "William"   "expressed" "his"       "'"         "profound" 
 [7] "sorrow"    "'"         "for"       "slavery"   "in"        "a"        
[ ... and 7 more ]

2022-03-10 DNK -> GRL.1 :
 [1] "Denmark's"   "Prime"       "Minister"    "Mette"       "Frederiksen"
 [6] "apologized"  "in"          "person"      "to"          "six"        
[11] "surviving"   "Greenlandic"
[ ... and 10 more ]

2022-02-17 NLD -> IDN.1 :
 [1] "Dutch"      "Prime"      "Minister"   "Mark"       "Rutte"     
 [6] "offered"    "apologies"  "for"        "systematic" "and"       
[11] "excessive"  "violence"  
[ ... and 7 more ]

2021-12-13 CAN -> CAN.1 :
 [1] "Canada's"   "defence"    "minister"   "apologized" "to"        
 [6] "victims"    "of"         "sexual"     "assault"    ","         
[11] "misconduct" "and"       
[ ... and 4 more ]

[ reached max_ndoc ... 390 more documents ]
tokens_pac[[1]] # to look at just the first one
 [1] "Dutch"      "Minister"   "of"         "Defence"    "Kasja"     
 [6] "Ollongren"  "offered"    "'"          "deepest"    "apologies" 
[11] "'"          "for"        "the"        "Dutch"      "failure"   
[16] "to"         "protect"    "the"        "victims"    "of"        
[21] "the"        "Srebrenica" "Genocide"   "."         
  1. Create a Document-Feature Matrix (dfm). By default, the dfm will create a column for each unique token and a row for each document. The values in the matrix are the frequency of each token in each document.
dfm_pac <- quanteda::dfm(tokens_pac) 
dfm_pac
Document-feature matrix of: 396 documents, 1,931 features (99.03% sparse) and 91 docvars.
                         features
docs                      dutch minister of defence kasja ollongren offered '
  2022-07-11 NLD -> BIH.1     2        1  2       1     1         1       1 2
  2022-06-08 BEL -> COD.1     0        0  0       0     0         0       0 0
  2022-03-23 GBR -> JAM.1     0        0  0       0     0         0       0 2
  2022-03-10 DNK -> GRL.1     0        1  0       0     0         0       0 0
  2022-02-17 NLD -> IDN.1     1        1  1       0     0         0       1 0
  2021-12-13 CAN -> CAN.1     0        1  1       1     0         0       0 0
                         features
docs                      deepest apologies
  2022-07-11 NLD -> BIH.1       1         1
  2022-06-08 BEL -> COD.1       0         0
  2022-03-23 GBR -> JAM.1       0         0
  2022-03-10 DNK -> GRL.1       0         0
  2022-02-17 NLD -> IDN.1       0         1
  2021-12-13 CAN -> CAN.1       0         0
[ reached max_ndoc ... 390 more documents, reached max_nfeat ... 1,921 more features ]

We can convert dfm_pac to a dataframe if we like (say, for plotting purposes):

dfm_as_data_frame <- quanteda::convert(dfm_pac, to="data.frame")
dim(dfm_as_data_frame)
[1]  396 1932
  1. Investigate the most frequent tokens in this corpus.
dfm_pac %>% quanteda::topfeatures()
       the          .         of        for         in apologized         to 
       544        310        307        304        229        219        203 
 president        and   minister 
       144        142        130 

It is fun to look at wordclouds:

quanteda.textplots::textplot_wordcloud(dfm_pac)

You might recall from the lecture that we need to remove certain tokens that are not useful for our analysis. Quanteda already has a list of stopwords (but you can also create your own). Let’s remove them and see what happens:

dfm_pac %>% 
  dfm_remove(quanteda::stopwords("en")) %>% 
  topfeatures()
         . apologized  president   minister      prime          ,        war 
       310        219        144        130        101         84         72 
 expressed   japanese          " 
        67         55         51 
dfm_pac %>% 
  dfm_remove(quanteda::stopwords("en")) %>%
  textplot_wordcloud()

🗣️ CLASSROOM-WIDE DISCUSSION: Would you remove any other tokens? Why?

Part III - Extracting the ‘object’ of the apology (30 min)

🧑‍🏫 THIS IS A TEACHING MOMENT

  1. Discover the keywords in context function (kwic).

Before we revisit our tokens, let’s look at an interesting feature of quanteda. We can search for a pattern (a keyword) in our corpus and see the text surrounding it using the kwic function.

For example, after reading the description of apologies, I am curious to see how words with the prefix ‘apolog-’ are used in the corpus:

quanteda::kwic(tokens_pac %>% head(n=10), pattern="apolog*")
Keyword-in-context with 8 matches.                                                                           
 [2022-07-11 NLD -> BIH.1, 10]           Kasja Ollongren offered' deepest |
  [2022-03-10 DNK -> GRL.1, 6] Denmark's Prime Minister Mette Frederiksen |
  [2022-02-17 NLD -> IDN.1, 7]          Prime Minister Mark Rutte offered |
  [2021-12-13 CAN -> CAN.1, 4]                  Canada's defence minister |
  [2021-11-27 NLD -> NLD.1, 5]             Dutch minister Van Engelshoven |
  [2021-11-25 SVK -> SVK.1, 4]                   The Slovakian government |
  [2021-10-29 ISR -> ISR.1, 5]             Israeli President Isaac Herzog |
  [2021-09-28 MEX -> MEX.1, 8]          Manuel Lopez Obrodador offered an |
                                                   
 apologies  | ' for the Dutch failure              
 apologized | in person to six surviving           
 apologies  | for systematic and excessive violence
 apologized | to victims of sexual assault         
 apologized | on behalf of the Dutch               
 apologized | to the thousands of Roma             
 apologized | for the 1956 Kafr Qasim              
  apology   | to the Yaqui indigenous people       

(Your class teacher will explain what * in the pattern="apolog*" parameter does)

In sum, the above is an example of a regular expression (regex), a language just for expressing patterns of strings. You can read more about regex in Chapter 15 of R for Data Science 2nd edition book

💡 Let’s be clever about kwic

Let’s try to make this data more interesting for further analysis. In the following steps, we will:

  • use the power of kwic to try to extract just the object of the apology
  • build a new corpus out of this new subset of text data
  • remove unnecessary tokens (stop words + punctuation)
  1. A closer look at the output of kwic. Pay close attention to the columns produced after running this function. Ah, this time, we increased the window of tokens that show up before and after the keyword:
quanteda::kwic(tokens_pac %>% head(n=10), pattern="apolog*", window=40) %>% as.data.frame()
                  docname from to
1 2022-07-11 NLD -> BIH.1   10 10
2 2022-03-10 DNK -> GRL.1    6  6
3 2022-02-17 NLD -> IDN.1    7  7
4 2021-12-13 CAN -> CAN.1    4  4
5 2021-11-27 NLD -> NLD.1    5  5
6 2021-11-25 SVK -> SVK.1    4  4
7 2021-10-29 ISR -> ISR.1    5  5
8 2021-09-28 MEX -> MEX.1    8  8
                                                          pre    keyword
1 Dutch Minister of Defence Kasja Ollongren offered ' deepest  apologies
2                  Denmark's Prime Minister Mette Frederiksen apologized
3                     Dutch Prime Minister Mark Rutte offered  apologies
4                                   Canada's defence minister apologized
5                              Dutch minister Van Engelshoven apologized
6                                    The Slovakian government apologized
7                              Israeli President Isaac Herzog apologized
8          President Andres Manuel Lopez Obrodador offered an    apology
                                                                                                                                                post
1                                                                        ' for the Dutch failure to protect the victims of the Srebrenica Genocide .
2                                                      in person to six surviving Greenlandic Inuit who were taken from their families as children .
3                                                                 for systematic and excessive violence in Indonesia's 1945-49 war of independence .
4                                                                       to victims of sexual assault , misconduct and discrimination in the military
5 on behalf of the Dutch government to the transgender community for forced sterilization of transgenders who wished to receive sex-change surgery .
6                                                                                 to the thousands of Roma women who have been forcibly sterilized .
7                                                                                                                   for the 1956 Kafr Qasim massacre
8                                                                           to the Yaqui indigenous people for wrongdoings against their ancestors .
  pattern
1 apolog*
2 apolog*
3 apolog*
4 apolog*
5 apolog*
6 apolog*
7 apolog*
8 apolog*

Note: the info we care about the most is the column post - it contains the text immediately after a match.

This is good, but there is a downside to the keyword we used. Not all entries have the term apolog* in their description. You can confirm that by comparing dim(df_pac) with kwic(tokens_pac, pattern="apolog*", window=40). Whereas the original data set had 396 records, the kwic output has only 270.

  1. Try adding a more complex pattern. Here, we combine multiple keywords using the | operator. This means we are looking for any of the keywords in the pattern.
df_kwic <- 
  quanteda::kwic(tokens_pac,
                 pattern="apolog*|regre*|sorrow*|recogni*|around*|sorry*|remorse*|failur*",
                 window=40) %>%
  as.data.frame()
dim(df_kwic)
[1] 355   7

We still seem to be losing some documents, but we are getting closer to what we want.

💡 Take a look at View(df_kwic)

  1. Handling duplicates Some rows are repeated because of multiple pattern matches in the same text:
df_kwic %>% group_by(docname) %>% filter(n() > 1)
# A tibble: 38 × 7
# Groups:   docname [19]
   docname                  from    to pre                 keyword post  pattern
   <chr>                   <int> <int> <chr>               <chr>   <chr> <fct>  
 1 2022-07-11 NLD -> BIH.1    10    10 "Dutch Minister of… apolog… "' f… apolog…
 2 2022-07-11 NLD -> BIH.1    15    15 "Dutch Minister of… failure "to … apolog…
 3 2020-11-30 NLD -> NLD.1     4     4 "The Dutch governm… apolog… "to … apolog…
 4 2020-11-30 NLD -> NLD.1    22    22 "The Dutch governm… recogn… "."   apolog…
 5 2017-12-20 MEX -> MEX.1     4     4 "The Mexican gover… apolog… "to … apolog…
 6 2017-12-20 MEX -> MEX.1    11    11 "The Mexican gover… recogn… "tha… apolog…
 7 2014-04-09 JPN -> PHL.1    11    11 "Japanese Ambassad… apology "\" … apolog…
 8 2014-04-09 JPN -> PHL.1    16    16 "Japanese Ambassad… remorse "\" … apolog…
 9 2010-03-30 SRB -> BIH.1     7     7 "The Serbian parli… apolog… "for… apolog…
10 2010-03-30 SRB -> BIH.1    11    11 "The Serbian parli… failure "to … apolog…
# ℹ 28 more rows

Here is how we will deal with these duplicates: let’s keep the one with the longest post text. This is equivalent to selecting the one with the earliest from value in the dataframe above.

df_kwic <- df_kwic %>% arrange(from) %>% group_by(docname) %>% slice(1) 
dim(df_kwic)
[1] 336   7

Note: This is a choice! There is no absolute objective way to handle this case. Would you do anything differently?

🏠 TAKE-HOME (OPTIONAL) ACTIVITY: We used to have 367 rows, but now we have 336. How would you change the pattern to avoid excluding data from the original data frame? (Note: I do not have a ready solution to this! Feel free to share yours on #help-labs)

  1. Go back to pre-processing the data. Now that we have a new dataframe, we can create a more robust workflow: produce a new corpus, handle the tokens, create a dfm (bag of words), convert it to a TF-IDF matrix, and plot a wordcloud:
corp_pac <- corpus(df_kwic, text_field="post", docid_field="docname")

my_stopwords <- c(stopwords("en"))

tokens_pac <- 
  # Get rid of punctuations
  tokens(corp_pac, remove_punct = TRUE) %>% 
  # Get rid of stopwords
  tokens_remove(pattern = my_stopwords) %>%
  # Use multiple ngrams
  # The tokens will be concatenated by "_"
  tokens_ngrams(n=1:3)

# Try to run the code below with and without the `dfm_tfidf` function
dfm_pac <- dfm(tokens_pac) # %>% dfm_tfidf()
textplot_wordcloud(dfm_pac)

dfm_pac %>% topfeatures()
         war        world    world_war       people           ii       war_ii 
          55           34           32           30           28           27 
world_war_ii      victims     massacre    suffering 
          27           25           24           22 

🗣️ CLASSROOM DISCUSSIONS:

  • In what ways is the wordcloud above fundamentally different from the one we did before?
  • Do you sense a theme in the words above? What is it?

Part IV. Dimensionality Reduction + Clustering (30 min)

Note

You will likely not have time to finish this section in class. If that is the case, you can finish it at home. If any questions arise outside of class, please use the #help-labs channel on Slack.

Instead of running k-means or any other clustering algorithm on the full dfm, let’s reduce the number of features of our dataset. This would save storage and make the process run faster.

🎯 ACTION POINTS

  1. You know about PCA - we’ve been playing with this linear dimensionality reduction technique for a while now. We want to show you an alternative method called Latent Sentiment Analysis (LSA) this time. The linear algebra behind it is a bit more complex, but the idea is the same: we want to reduce the number of features (words) in our dataset to just a few dimensions - even if that comes with the cost of losing some interpretability.

One of the quanteda packages has a function called textmodel_lsa that does this for us. We will use it to reduce the number of features to 3 dimensions:

df_lsa <- quanteda.textmodels::textmodel_lsa(dfm_pac %>% dfm_tfidf(), nd=3)$docs %>% as.data.frame()

head(df_lsa)
                                              V1           V2           V3
1957-01-01 JPN -> AUS.1             0.0005528981 -0.001251695 -0.002648151
1957-01-01 JPN -> MMR.1             0.0027806416 -0.006119656 -0.006830444
1965-06-22 JPN -> KOR.1             0.0138885554 -0.026908436 -0.050978716
1974-04-09 PAK -> BGD.1             0.0070112979 -0.014150576  0.001994939
1982-08-26 JPN -> *Transnational*.1 0.0138860542 -0.030833950 -0.041480850
1984-09-06 JPN -> KOR.1             0.0010018834 -0.002204787 -0.006937128
  1. Visualise it. Let’s plot the first 3 dimensions of the LSA output:
# Let's treat you to an interactive plot, while we are at it:
plot_ly(data =  bind_cols(df_lsa, df_kwic), 
        x = ~V1, 
        y = ~V2, 
        z = ~V3,
        size = 3,
        alpha = 0.7,
        type="scatter3d", 
        mode="markers", 
        text=~paste('Doc ID:', docname))
  1. Investigate the clear outlier:
# Search for this entry in the original data frame
df_pac %>% 
        filter(apology_id == "1995-07-01 JPN -> *Transnational*") %>%
        pull(description) %>% 
        print()
[1] "Japanese Prime Minister Tomiichi Murayama apologized to the roughly 200,000 women who were put into brothels by Japanese forces to serve as sex slaves or “comfort women” and sets up a private “Asian Women’s Fund” to deal with reparations. "
  1. How many clusters are there in this dataset? Here is a new trick for you: the NbClust package. It implements 30 indices for determining the number of clusters in a dataset and offers a voting system to decide the best number. This makes it a little less subjective than the elbow method.
res.nbclust <- df_lsa %>% select(V1, V2, V3) %>%
    scale() %>%
    # You can change the distance metric here
    NbClust(distance = "euclidean",
            min.nc = 2, max.nc = 10, 
            method = "complete", index ="all") 

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 6 proposed 2 as the best number of clusters 
* 2 proposed 3 as the best number of clusters 
* 7 proposed 4 as the best number of clusters 
* 3 proposed 6 as the best number of clusters 
* 2 proposed 7 as the best number of clusters 
* 3 proposed 9 as the best number of clusters 
* 1 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  4 
 
 
******************************************************************* 

The indices have voted! They say the best number of clusters is 4.

  1. Apply topic modelling. Instead of k-means, we will use another unsupervised technique called topic modelling. As the name implies, this clustering technique is more suitable for text data. The most popular of such families of models is called Latent Dirichlet Allocation (LDA). We will not go into how it works, but you can read more about it here.
tmod_lda <- topicmodels::LDA(dfm_pac %>% dfm_subset(ntoken(dfm_pac) > 0), k = 4)

## What are the topics mostly associated with each cluster?
tmod_lda %>% topicmodels::terms(10)
      Topic 1    Topic 2     Topic 3        Topic 4    
 [1,] "violence" "suffering" "war"          "war"      
 [2,] "victims"  "people"    "world"        "people"   
 [3,] "state"    "wwii"      "world_war"    "committed"
 [4,] "war"      "victims"   "war_ii"       "past"     
 [5,] "nazi"     "role"      "world_war_ii" "victims"  
 [6,] "actions"  "jews"      "ii"           "south"    
 [7,] "terror"   "women"     "massacre"     "suffering"
 [8,] "people"   "community" "people"       "crimes"   
 [9,] "republic" "roma"      "americans"    "massacre" 
[10,] "forced"   "caused"    "women"        "caused"   

Assign each row to a topic:

df_topics <- tmod_lda %>% topicmodels::topics() %>% as.data.frame()
# Fix data types
df_topics <- tibble::rownames_to_column(df_topics, "docname")
colnames(df_topics) <- c("docname", "topic")
df_topics$topic <- as.factor(df_topics$topic)

df_kwic <- left_join(df_kwic, df_topics, by="docname")

Let’s plot the clusters. This time, I’m adding the post column to the tooltip so you can read the description of each apology:

# Thanks, ChatGPT, for providing the regex below!
# It breaks the text into lines of 30 characters
# but avoids breaking words in the middle
better_tooltip <- 
        paste('Doc ID:',
              df_kwic$docname, 
              '\nDescription:\n', 
              str_replace_all(df_kwic$post, "(.{1,30})(\\s|$)", "\\1\n"))

plot_ly(data =  bind_cols(df_lsa, df_kwic), 
        x = ~V1, 
        y = ~V2, 
        z=~V3,
        size = 3,
        color = ~topic,
        type="scatter3d", 
        mode="markers", 
        text=better_tooltip)

🤔 It looks like the 3D representation does not truly encapsulate the topics/clusters in the same region of the space. Why do you think that is?

  1. Which tokens best describe each cluster? Let’s use the textstat_keyness function to find out:

We can use the concept of keyness to score words in relation to a target vs a reference group. Read more about keyness here.

# Change the cluster number to see the results for each cluster
selected_cluster = 1

tstat_key <- textstat_keyness(dfm_pac, 
                              measure="chi", 
                              target = case_when(is.na(df_kwic$topic) ~ FALSE, 
                                                 df_kwic$topic == selected_cluster ~ TRUE,
                                                 .default = FALSE))
textplot_keyness(tstat_key, labelsize=2)

Plot a word cloud with just the target group:

textplot_wordcloud(tstat_key, comparison=FALSE, min_count=2)
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
victims_white_terror could not be fit on page. It will not be plotted.
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
violence_catalan_general could not be fit on page. It will not be plotted.
Warning in wordcloud(x, min_size, max_size, min_count, max_words, color, :
women_forced_work could not be fit on page. It will not be plotted.

Wordcloud to compare target vs reference:

textplot_wordcloud(tstat_key, comparison=TRUE, min_count=2)

🗣️ CLASSROOM DISCUSSIONS

This is your last DS202W lab! Take some time to reflect on how the text mining application we did today encapsulates many of the main concepts we have learned in this course.

Before you say goodbye to your classmates and teacher, discuss the following questions:

  • Why do you think we deliberately showed you new dimensionality reduction (LSA) and clustering techniques (LDA) today? Why not just stick to PCA and k-means?
  • Why do you think there are so many clustering techniques out there? See, for example, this table from the Python library, scikit-learn.
  • Would you use any of the techniques we learned today in your own research? Why or why not?